In [1]:
import pegasus as pg

Count Matrix File

For this tutorial, we provide a count matrix dataset on Human Bone Marrow with 8 donors. It is stored in pegasus hdf5 format (with file extension ".h5sc").

Download tutorial data

In [2]:
!wget https://github.com/klarman-cell-observatory/scCloud-tutorial-data/raw/master/MantonBM_nonmix_subset.h5sc
--2019-09-24 12:36:37--  https://github.com/klarman-cell-observatory/scCloud-tutorial-data/raw/master/MantonBM_nonmix_subset.h5sc
Resolving github.com (github.com)... 192.30.253.113
Connecting to github.com (github.com)|192.30.253.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/klarman-cell-observatory/scCloud-tutorial-data/master/MantonBM_nonmix_subset.h5sc [following]
--2019-09-24 12:36:37--  https://raw.githubusercontent.com/klarman-cell-observatory/scCloud-tutorial-data/master/MantonBM_nonmix_subset.h5sc
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 64346689 (61M) [application/octet-stream]
Saving to: ‘MantonBM_nonmix_subset.h5sc.1’

MantonBM_nonmix_sub 100%[===================>]  61.37M  7.87MB/s    in 8.1s    

2019-09-24 12:36:45 (7.59 MB/s) - ‘MantonBM_nonmix_subset.h5sc.1’ saved [64346689/64346689]

When the downloading is finished, you can load the file using pegasus read_input function:

In [3]:
adata = pg.read_input("MantonBM_nonmix_subset.h5sc")
adata
2019-09-24 12:36:47,576 - sccloud - INFO - Read input is finished. Time spent = 1.51s.
Out[3]:
AnnData object with n_obs × n_vars = 48000 × 33694 
    obs: 'Channel'
    var: 'gene_ids'
    uns: 'genome'

The count matrix is manipulated as an Annotated Data Matrix structure (see anndata.AnnData for details; the figure below is also borrowed from this link):

It has 5 major parts:

  • Raw count matrix: adata.X, a Scipy sparse matrix, with rows the cell barcodes, columns the genes/features:
In [4]:
adata.X
Out[4]:
<48000x33694 sparse matrix of type '<class 'numpy.float32'>'
	with 38727916 stored elements in Compressed Sparse Row format>

This dataset contains $48,000$ barcodes and $33,694$ genes.

  • Cell barcode attributes: adata.obs, a Pandas data frame with barcode as the index. For now, there is only one attribute "Channel" referring to the donor from which the cell comes from:
In [5]:
adata.obs.head()
Out[5]:
Channel
MantonBM1_HiSeq_1-AAACCTGAGCAGGTCA MantonBM1_HiSeq_1
MantonBM1_HiSeq_1-AAACCTGCACACTGCG MantonBM1_HiSeq_1
MantonBM1_HiSeq_1-AAACCTGCACCGGAAA MantonBM1_HiSeq_1
MantonBM1_HiSeq_1-AAACCTGCATAGACTC MantonBM1_HiSeq_1
MantonBM1_HiSeq_1-AAACCTGCATCGATGT MantonBM1_HiSeq_1
In [6]:
adata.obs['Channel'].value_counts()
Out[6]:
MantonBM6_HiSeq_1    6000
MantonBM1_HiSeq_1    6000
MantonBM4_HiSeq_1    6000
MantonBM7_HiSeq_1    6000
MantonBM2_HiSeq_1    6000
MantonBM3_HiSeq_1    6000
MantonBM8_HiSeq_1    6000
MantonBM5_HiSeq_1    6000
Name: Channel, dtype: int64

From the summary above, you can see that each of the 8 donors (numbered from 1 to 8) gives $6,000$ cells.

  • Gene attributes: adata.var, also a Pandas data frame, with gene name as the index. For now, it only has one attribute "gene_ids" referring to the unique gene ID in the experiment:
In [7]:
adata.var.head()
Out[7]:
gene_ids
RP11-34P13.3 ENSG00000243485
FAM138A ENSG00000237613
OR4F5 ENSG00000186092
RP11-34P13.7 ENSG00000238009
RP11-34P13.8 ENSG00000239945
  • Unstructured information: adata.uns, a Python hashed dictionary. It usually stores information not restricted to barcodes or features, but about the whole dataset, such as its genome reference:
In [8]:
adata.uns['genome']
Out[8]:
'GRCh38'
  • Finally, embedding attributes on cell barcodes: adata.obsm; as well as on genes, adata.varm. We'll see it in later sections.

Preprocessing

Filtration

The first step in preprocessing is to perform the quality control analysis, and remove cells and genes of low quality.

We can generate QC metrics using the following method with default settings:

In [9]:
pg.qc_metrics(adata)

The metrics considered are:

  • Number of genes: by default, keep cells with $500 \leq \text{# Genes} < 6000$;
  • Number of UMIs: by default, keep cells with $100 \leq \text{# UMIs} < 600,000$;
  • Percent of Mitochondrial genes: by default, keep cells with percent $< 10\%$;
  • Percent of cells: by default, genes that are expressed in at least $0.05\%$ of cells are marked as robust.

For details on customizing your own thresholds, see documentation.

Numeric summaries on filtration on cell barcodes and genes can be achieved by get_filter_stats method:

In [10]:
stats_samples, stats_genes = pg.get_filter_stats(adata)

The results are two pandas data frames, one for samples, the other for genes:

In [11]:
stats_samples
Out[11]:
kept median_n_genes median_n_umis median_percent_mito filt total median_n_genes_before median_n_umis_before median_percent_mito_before
Channel
MantonBM5_HiSeq_1 3956 752 2752.0 3.231436 2044 6000 623.0 2051.0 3.607993
MantonBM4_HiSeq_1 4067 773 2252.0 3.344811 1933 6000 657.0 1727.5 3.651207
MantonBM3_HiSeq_1 4161 757 2568.0 3.371369 1839 6000 673.0 2058.5 3.668615
MantonBM1_HiSeq_1 4294 766 2472.0 3.845610 1706 6000 687.0 2024.5 4.067210
MantonBM7_HiSeq_1 4330 723 2354.0 3.171049 1670 6000 656.0 1991.5 3.293506
MantonBM8_HiSeq_1 4400 719 2515.5 3.623398 1600 6000 654.5 2170.5 3.837770
MantonBM6_HiSeq_1 4580 827 2635.0 3.125128 1420 6000 757.5 2263.0 3.385622
MantonBM2_HiSeq_1 4866 782 2430.5 3.616962 1134 6000 731.5 2181.5 3.656562
In [12]:
stats_genes
Out[12]:
n_cells percent_cells
gene
LINC01281 17 0.049056
RP11-268P4.4 17 0.049056
RP13-216E22.4 17 0.049056
CTD-2184C24.2 17 0.049056
RP1-150O5.3 17 0.049056
... ... ...
RP1-137H15.2 0 0.000000
ZIC3 0 0.000000
RP6-27P15.2 0 0.000000
FGF13-AS1 0 0.000000
FAM231B 0 0.000000

17186 rows × 2 columns

You can also check the QC stats via plots. First is the violin plot:

In [13]:
pg.violin(adata, keys = ['n_genes', 'n_counts', 'percent_mito'], by = 'passed_qc')
Out[13]:

Cells marked with passed_qc == True are those passing QC, while the others will be filtered out.

Then is scatterplot on the QC metrics on samples:

In [14]:
pg.scatter(adata, 'n_genes', 'n_counts', color = 'passed_qc')
Out[14]:

This plot indicates that number of UMIs and number of genes are positively correlated.

In [15]:
pg.scatter(adata, 'n_genes', 'percent_mito', color = 'passed_qc')
Out[15]:

The plot above indicates that cells with high percent of mitochondrial genes tend to have fewer signals.

Now filter data based on QC metrics set in qc_metrics:

In [16]:
pg.filter_data(adata)
adata
2019-09-24 12:36:53,129 - sccloud - INFO - After filteration, 34654 cells and 23460 genes are kept. Among 23460 genes, 16508 genes are robust.
Out[16]:
AnnData object with n_obs × n_vars = 34654 × 23460 
    obs: 'Channel', 'passed_qc', 'n_genes', 'n_counts', 'percent_mito'
    var: 'gene_ids', 'n_cells', 'percent_cells', 'robust', 'highly_variable_features'
    uns: 'genome'

After filtration, $34,654$ cells ($72.20\%$) and $23,460$ genes ($69.63\%$) are kept.

We can also view the cells after filtration with respect to donors:

In [17]:
adata.obs['Channel'].value_counts()
Out[17]:
MantonBM2_HiSeq_1    4866
MantonBM6_HiSeq_1    4580
MantonBM8_HiSeq_1    4400
MantonBM7_HiSeq_1    4330
MantonBM1_HiSeq_1    4294
MantonBM3_HiSeq_1    4161
MantonBM4_HiSeq_1    4067
MantonBM5_HiSeq_1    3956
Name: Channel, dtype: int64

Normalization and Logarithmic Transformation

After filtration, we need to first normalize the distribution of cells w.r.t. each gene to have the same count (default is $10^5$, see documentation), and then transform into logarithmic space by $log(x + 1)$ to avoid number explosion:

In [18]:
pg.log_norm(adata)
2019-09-24 12:36:54,691 - sccloud - INFO - Normalization is finished. Time spent = 1.54s.

For the downstream analysis, we may need to make a copy of the count matrix, in case of coming back to this step and redo the analysis:

In [19]:
adata_trial = adata.copy()

Highly Variable Gene Selection

Highly Variable Genes (HVG) are more likely to convey information discriminating different cell types and states. Thus, rather than considering all genes, people usually focus on selected HVGs for downstream analyses.

You need to set consider_batch flag to consider or not consider batch effect. At this step, set it to False:

In [20]:
pg.highly_variable_features(adata_trial, consider_batch = False)
2019-09-24 12:36:55,750 - sccloud - INFO - 2000 highly variable features have been selected. Time spent = 0.89s.

By default, we select 2000 HVGs using the pegasus selection method. Alternative, you can also choose the traditional method that both Seurat and SCANPY use, by setting flavor == 'Seurat'. See documentation for details.

Below is a scatterplot of highly variable genes for this dataset. Each point stands for one gene. Orange points are selected to be highly variable genes, which account for the majority of variation of the dataset.

In [21]:
pg.variable_feature_plot(adata_trial)
Out[21]:

Below lists the top 5 HVGs:

In [22]:
adata_trial.var.loc[adata_trial.var['highly_variable_features']].sort_values(by = 'hvf_rank').head()
Out[22]:
gene_ids n_cells percent_cells robust highly_variable_features mean var hvf_loess hvf_rank
LYZ ENSG00000090382 8918 25.734403 True True 1.572100 8.107291 3.861553 3
S100A9 ENSG00000163220 8638 24.926415 True True 1.476007 7.648051 3.727059 5
S100A8 ENSG00000143546 8146 23.506666 True True 1.383504 7.237579 3.547860 7
HLA-DRA ENSG00000204287 14683 42.370289 True True 2.261641 7.525434 4.239100 12
GNLY ENSG00000115523 5464 15.767300 True True 0.922636 4.958139 2.597217 14

Principal Component Analysis

To reduce the dimension of data, Principal Component Analysis (PCA) is widely used. Briefly speaking, PCA transforms the data from original dimensions into a new set of Principal Components (PC) of a much smaller size. In the transformed data, dimension is reduced, while PCs still cover a majority of the variation of data. Moreover, the new dimensions (i.e. PCs) are independent with each other.

pegasus uses the following method to perform PCA:

In [23]:
pg.pca(adata_trial)
2019-09-24 12:36:58,561 - sccloud - INFO - PCA is done. Time spent = 1.85s.

By default, scc.pca uses:

  • Before PCA, scale the data to standard Normal distribution $N(0, 1)$, and truncate them with max value $10$;
  • Number of PCs to compute: 50;
  • Apply PCA only to highly variable features.

See its documentation for customization.

To explain the meaning of PCs, let's look at the first PC (denoted as $PC_1$), which covers the most of variation:

In [24]:
coord_pc1 = adata_trial.uns['PCs'][:, 0]
coord_pc1
Out[24]:
array([ 0.02267287,  0.01808362, -0.0057981 , ..., -0.00876948,
        0.04856516,  0.03503553], dtype=float32)

This is an array of 2000 elements, each of which is a coefficient corresponding to one HVG.

With the HVGs as the following:

In [25]:
adata_trial.var.loc[adata_trial.var['highly_variable_features']].index.values
Out[25]:
array(['HES4', 'ISG15', 'TNFRSF18', ..., 'S100B', 'MT-CO1', 'MT-CO3'],
      dtype=object)

$PC_1$ is computed by

\begin{equation*} PC_1 = \text{coord_pc1}[0] \cdot \text{HES4} + \text{coord_pc1}[1] \cdot \text{ISG15} + \text{coord_pc1}[2] \cdot \text{TNFRSF18} + \cdots + \text{coord_pc1}[1997] \cdot \text{S100B} + \text{coord_pc1}[1998] \cdot \text{MT-CO1} + \text{coord_pc1}[1999] \cdot \text{MT-CO3} \end{equation*}

Therefore, all the 50 PCs are the linear combinations of the 2000 HVGs.

The calculated PCA count matrix is stored in the obsm field, which is the first embedding object we have

In [26]:
adata_trial.obsm['X_pca'].shape
Out[26]:
(34654, 50)

For each of the $34,654$ cells, its count is now w.r.t. 50 PCs, instead of 2000 HVGs.

Nearest Neighbors

All the downstream analysis, including clustering and visualization, needs to construct a k-Nearest-Neighbor (kNN) graph on cells. We can build such a graph using neighbors method:

In [27]:
pg.neighbors(adata_trial)
2019-09-24 12:37:06,672 - sccloud - INFO - Nearest neighbor search is finished in 8.08s.
2019-09-24 12:37:08,037 - sccloud - INFO - Constructing affinity matrix is done. Time spent = 1.36s.
2019-09-24 12:37:08,043 - sccloud - INFO - Affinity matrix calculation is finished in 1.37s

It uses the default setting:

  • For each cell, calculate its 100 nearest neighbors;
  • Use PCA matrix for calculation;
  • Use L2 distance as the metric;
  • Use hnswlib search algorithm to calculate the approximated nearest neighbors in a really short time.

See its documentation for customization.

Below is the result:

In [28]:
print("Get {} nearest neighbors (excluding itself) for each cell.".format(adata_trial.uns['pca_knn_indices'].shape[1]))
adata_trial.uns['pca_knn_indices']
Get 99 nearest neighbors (excluding itself) for each cell.
Out[28]:
array([[28469,  1133, 33155, ..., 19789, 28167,  2377],
       [29551,  1819,  2873, ...,  3754, 27307, 21255],
       [ 2777,    29, 19111, ...,  6551, 32843,  9781],
       ...,
       [32867, 30753, 20524, ..., 32610, 20148, 20084],
       [31402, 30381, 17216, ...,  8459,  9244, 16175],
       [10317, 34593, 32525, ...,  3981, 28426,  4620]])
In [29]:
adata_trial.uns['pca_knn_distances']
Out[29]:
array([[ 4.8436975,  5.1501913,  5.3732085, ...,  6.464017 ,  6.4657097,
         6.4672294],
       [ 7.1352835,  7.3558335,  7.780309 , ...,  9.517456 ,  9.520907 ,
         9.521029 ],
       [ 5.158068 ,  5.263183 ,  5.4073806, ...,  6.503496 ,  6.503737 ,
         6.5085793],
       ...,
       [ 6.520121 ,  6.7687097,  6.8119335, ...,  9.087468 ,  9.087721 ,
         9.089647 ],
       [ 8.348032 ,  9.3632345,  9.794904 , ..., 11.772722 , 11.786986 ,
        11.796101 ],
       [ 7.4894295,  7.698165 ,  7.7886786, ..., 11.34852  , 11.353077 ,
        11.432337 ]], dtype=float32)

Each row corresponds to one cell, listing its neighbors (not including itself) from nearest to farthest. adata_trial.uns['pca_knn_indices'] stores their indices, and adata_trial.uns['pca_knn_distances'] stores distances.

Clustering and Visualization

Now we are ready to cluster the data for cell type detection. pegasus provides 4 clustering algorithms to use. In this tutorial, we use the Louvain algorithm:

In [30]:
pg.louvain(adata_trial)
2019-09-24 12:37:09,762 - sccloud - INFO - Graph is constructed. Time spent = 1.70s.
2019-09-24 12:37:28,717 - sccloud - INFO - Louvain clustering is done. Time spent = 20.65s.

As a result, Louvain algorithm finds 19 clusters:

In [31]:
adata_trial.obs['louvain_labels'].value_counts()
Out[31]:
1     4898
2     4256
3     4173
4     3494
5     2867
6     2484
7     2158
8     1929
9     1613
10    1269
11    1026
12     915
13     838
14     583
15     567
16     444
17     383
18     382
19     375
Name: louvain_labels, dtype: int64

We can check each cluster's composition regarding donors via a composition plot:

In [32]:
pg.composition_plot(adata_trial, by = 'louvain_labels', condition = 'Channel')
Out[32]:

However, we can see a clear batch effect in the plot: e.g. Cluster 10 and 13 have most cells from Donor 3.

We can see it more clearly in its FIt-SNE plot (a visualization algorithm which we will talk about later):

In [33]:
pg.fitsne(adata_trial)
2019-09-24 12:38:16,227 - sccloud - INFO - FIt-SNE is calculated. Time spent = 47.23s.
In [34]:
pg.embedding(adata_trial, basis = 'fitsne', keys = ['louvain_labels', 'Channel'])
Out[34]:

Batch Correction

Batch effect occurs when data samples are generated in different conditions, such as date, weather, lab setting, equipment, etc. Unless informed that all the samples were generated under the similar condition, people may suspect presumable batch effects if they see a visualization graph with samples kind-of isolated from each other.

For this dataset, we need the batch correction step to reduce such a batch effect, which is observed in the plot above:

In [35]:
pg.highly_variable_features(adata, consider_batch = True)
pg.correct_batch(adata, features = 'highly_variable_features')
2019-09-24 12:38:18,964 - sccloud - INFO - Estimation on feature statistics per channel is finished. Time spent = 1.37s.
2019-09-24 12:38:19,037 - sccloud - INFO - 2000 highly variable features have been selected. Time spent = 1.45s.
2019-09-24 12:38:19,049 - sccloud - INFO - Adjustment parameters are estimated.
2019-09-24 12:38:19,421 - sccloud - INFO - Features are selected.
2019-09-24 12:38:20,065 - sccloud - INFO - Batch correction is finished. Time spent = 0.65s.
In [36]:
adata.uns['fmat_highly_variable_features'].shape
Out[36]:
(34654, 2000)

Batch correction result is stored at adata.uns['fmt_highly_variable_features']. By default, it uses adata.obs['Channel'] as the batch key, i.e. there are 8 batches in this dataset.

See its documnetation for customization. Moreover, if you want to use some other cell attribute(s) as the batch key, consider using scc.set_group_attribute method (see details).

Repeat Previous Steps on the Corrected Data

As the count matrix is changed by batch correction, we need to redo steps starting from PCA down to clustering:

In [37]:
pg.pca(adata)
pg.neighbors(adata)
pg.louvain(adata)
2019-09-24 12:38:21,833 - sccloud - INFO - PCA is done. Time spent = 1.76s.
2019-09-24 12:38:30,950 - sccloud - INFO - Nearest neighbor search is finished in 9.12s.
2019-09-24 12:38:32,447 - sccloud - INFO - Constructing affinity matrix is done. Time spent = 1.50s.
2019-09-24 12:38:32,455 - sccloud - INFO - Affinity matrix calculation is finished in 1.50s
2019-09-24 12:38:34,442 - sccloud - INFO - Graph is constructed. Time spent = 1.99s.
2019-09-24 12:38:47,942 - sccloud - INFO - Louvain clustering is done. Time spent = 15.49s.

Let's check the composition plot now:

In [38]:
pg.composition_plot(adata, by = 'louvain_labels', condition = 'Channel')
Out[38]:

If everything goes properly, you should be able to see that no cluster has a dominant donor cells. Also notice that Louvain algorithm on the corrected data finds 16 clusters, instead of the original 19 ones.

Also, FIt-SNE plot is different:

In [39]:
pg.fitsne(adata)
2019-09-24 12:39:40,905 - sccloud - INFO - FIt-SNE is calculated. Time spent = 52.69s.
In [40]:
pg.embedding(adata, basis = 'fitsne', keys = ['louvain_labels', 'Channel'])
Out[40]:

Similarly, if everything goes properly, you should be able to see that cells from different donors are better mixed in the right-hand-side plot.

Visualization

tSNE Plot

In previous sections, we have seen data visualization using FIt-SNE. FIt-SNE is a fast implementation on tSNE algorithm. Including FIt-SNE, pegasus provides 3 different tSNE plotting methods:

UMAP Plot

Besides tSNE, pegasus also provides UMAP plotting methods:

Below is the UMAP plot of the data using umap method:

In [41]:
pg.umap(adata)
2019-09-24 12:39:42,215 - sccloud - INFO - UMAP(a=None, angular_rp_forest=False, b=None, init='spectral',
     learning_rate=1.0, local_connectivity=1.0, metric='euclidean',
     metric_kwds=None, min_dist=0.5, n_components=2, n_epochs=None,
     n_neighbors=15, negative_sample_rate=5, random_state=0,
     repulsion_strength=1.0, set_op_mix_ratio=1.0, spread=1.0,
     target_metric='categorical', target_metric_kwds=None,
     target_n_neighbors=-1, target_weight=0.5, transform_queue_size=4.0,
     transform_seed=42, verbose=True)
2019-09-24 12:39:42,215 - sccloud - INFO - Construct fuzzy simplicial set
2019-09-24 12:39:44,168 - sccloud - INFO - Construct embedding
	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
2019-09-24 12:40:09,588 - sccloud - INFO - UMAP is calculated. Time spent = 27.39s.
In [42]:
pg.embedding(adata, basis = 'umap', keys = ['louvain_labels', 'Channel'])
Out[42]:

Differential Expression Analysis

With the clusters ready, we can now perform Differential Expression (DE) Analysis, which is crucial for identifying the cell type of each cluster.

Now use de_analysis method to run DE analysis. We use Louvain result here.

Notice that for notebooks running on Terra, you need to explicitly set temp_folder parameter to avoid the incorrect "No space left" error:

In [43]:
pg.de_analysis(adata, cluster = 'louvain_labels', auc = False, t = True, fisher = False, mwu = False,
                temp_folder = "/tmp")
2019-09-24 12:40:29,540 - sccloud - INFO - Collecting basic statistics is done. Time spent = 18.49s.
2019-09-24 12:40:31,134 - sccloud - INFO - Welch's t-test is done. Time spent = 1.57s.
2019-09-24 12:40:31,275 - sccloud - INFO - Differential expression analysis is finished. Time spent = 20.44s.

By default, DE analysis runs

  • auc: Area under ROC (AUROC) and Area under Precision-Recall (AUPR).
  • t: Welch’s t test.

Alternatively, you can also run the follow tests by setting their corresponding parameters to be True:

  • fisher: Fisher’s exact test.
  • mwu: Mann-Whitney U test.

DE analysis result is stored with key de_res (by default) in varm field of data. See documentation for more details.

To load the result in a human-readable format, use markers method:

In [44]:
marker_dict = scc.markers(adata)

By default, markers:

  • Sort genes by WAD scores in descending order;
  • Use $\alpha = 0.05$ significance level on q-values for inference.

See documentation for customizing these parameters.

Let's see the up-regulated genes for Cluster 1:

In [45]:
marker_dict['1']['up']
Out[45]:
mean_logExpr mean_logExpr_other log_fold_change percentage percentage_other percentage_fold_change WAD_score t_pval t_qval
feature
LTB 4.300630 1.946896 2.353734 90.104515 42.100132 2.140243 0.846916 0.000000 0.000000
SARAF 4.537941 3.028602 1.509338 94.150368 71.809067 1.311121 0.657711 0.000000 0.000000
MALAT1 8.956445 8.410056 0.546389 100.000000 99.237518 1.007683 0.546389 0.000000 0.000000
JUNB 4.400054 3.195513 1.204542 88.553604 70.562637 1.254964 0.526906 0.000000 0.000000
EIF3E 4.429431 3.258775 1.170657 93.324341 74.674469 1.249749 0.518327 0.000000 0.000000
... ... ... ... ... ... ... ... ... ...
GCK 0.004845 0.000900 0.003945 0.134862 0.024372 5.533571 0.000002 0.024471 0.042328
DBH-AS1 0.004723 0.000812 0.003911 0.134862 0.024372 5.533571 0.000002 0.021712 0.037794
RP11-83C7.1 0.004364 0.000108 0.004256 0.118004 0.003482 33.893124 0.000002 0.010111 0.018549
RP11-76E12.1 0.003853 0.000246 0.003607 0.101146 0.006963 14.525624 0.000001 0.023376 0.040536
DCDC2B 0.003819 0.000296 0.003523 0.101146 0.010445 9.683749 0.000001 0.024827 0.042899

1313 rows × 9 columns

Among them, TRAC worth notification. It is a critical marker for T cells.

We can also use Volcano plot to see the DE result. Below is such a plot w.r.t. Cluster 1 with log fold change being the metric:

In [46]:
pg.volcano(adata, cluster_ids = ['1'])
Out[46]:

To store a specific DE analysis result to file, you can write_results_to_excel methods in pegasus:

In [47]:
pg.write_results_to_excel(marker_dict, "MantonBM_nonmix_subset.louvain_labels.de.xlsx")

Cell Type Annotation

After done with DE analysis, we can use the test result to annotate the clusters.

In [48]:
celltype_dict = pg.infer_cell_types(adata, markers = 'human_immune', de_test = 't')
cluster_names = pg.infer_cluster_names(celltype_dict)

infer_cell_types has 2 critical parameters to set:

  • markers: Either 'human_immune', 'mouse_immune', 'human_brain', 'mouse_brain', or a user-specified marker dictionary.
  • de_test: Decide which DE analysis test to be used for cell type inference. It can be either 't', 'fisher', or 'mwu'.

infer_cluster_names by default uses threshold = 0.5 to filter out candidate cell types of scores lower than 0.5.

See documentation for details.

Next, substitute the inferred cluster names in data:

In [49]:
adata.rename_categories('louvain_labels', cluster_names)

And plot data with cluster names. You can either plot names on data or not, by setting legend to "data":

In [50]:
pg.embedding(adata, basis = 'fitsne', keys = ['louvain_labels'])
Out[50]:
In [51]:
pg.embedding(adata, basis = 'fitsne', keys = ['louvain_labels'], legend='data')
Out[51]:

Gene-specific Plots

Dot Plot

In [52]:
marker_genes = ['CD38', 'JCHAIN', 'FCGR3A', 'HLA-DPA1', 'CD14', 'CD79A', 'MS4A1', 'CD34', 'TRAC', 'CD3D', 'CD8A',
                'CD8B', 'GYPA', 'NKG7', 'CD4', 'SELL', 'CCR7']

pg.dotplot(adata, keys = marker_genes, by = 'louvain_labels')
Out[52]:

Heatmap

In [53]:
pg.heatmap(adata, keys = marker_genes, by = 'louvain_labels')
Out[53]:

Violin Plot

In [54]:
pg.violin(adata, keys = ['TRAC'], by = 'louvain_labels', width = 900, height = 450)
Out[54]:

Feature Plot

In [55]:
pg.embedding(adata, basis = 'fitsne', keys = ['TRAC', 'CD79A', 'CD14', 'CD34'])
Out[55]:

Cell Development Trajectory and Diffusion Map

Alternative, pegasus provides cell development trajectory plots using Force-directed Layout (FLE) algorithm:

Moreover, calculation of FLE plots is on Diffusion Map of the data, rather than directly on data points, in order to achieve a better efficiency. Thus, we need to first compute the diffusion map structure:

In [56]:
pg.diffmap(adata)
2019-09-24 12:40:39,192 - sccloud - INFO - Calculating connected components is done.
2019-09-24 12:40:39,410 - sccloud - INFO - Calculating normalized affinity matrix is done.
2019-09-24 12:40:48,876 - sccloud - INFO - Detected knee point at t = 188.
2019-09-24 12:40:48,912 - sccloud - INFO - diffmap finished. Time spent = 9.76s.

By default, diffmap method uses:

  • Number of Diffusion Components = $100$.
  • Compute diffusion map from PCA matrix.
In [57]:
adata.obsm['X_diffmap'].shape
Out[57]:
(34654, 99)

Now we are ready to calculate the cell developing trajectory. Below is FLE plot using net_fle method:

In [58]:
pg.net_fle(adata)
2019-09-24 12:40:55,297 - sccloud - INFO - Nearest neighbor search is finished in 6.36s.
2019-09-24 12:40:56,081 - sccloud - INFO - Constructing affinity matrix is done. Time spent = 0.78s.
2019-09-24 12:40:56,084 - sccloud - INFO - Affinity matrix calculation is finished in 0.79s
2019-09-24 12:40:56,089 - sccloud - INFO - select_cells finished. Time spent = 0.0034s.
2019-09-24 12:40:56,625 - sccloud - INFO - Constructing affinity matrix is done. Time spent = 0.08s.
2019-09-24 12:40:56,683 - sccloud - INFO - Graph is constructed. Time spent = 0.06s.
/var/folders/9y/7cx5sqf153gc_x672pscmn2c0000gq/T/tmpy1t1mi8d.small.net is written.
['java', '-Djava.awt.headless=true', '-Xmx8g', '-cp', '//anaconda3/envs/sccloud/lib/python3.7/site-packages/forceatlas2/ext/forceatlas2.jar://anaconda3/envs/sccloud/lib/python3.7/site-packages/forceatlas2/ext/gephi-toolkit-0.9.2-all.jar', 'kco.forceatlas2.Main', '--input', '/var/folders/9y/7cx5sqf153gc_x672pscmn2c0000gq/T/tmpy1t1mi8d.small.net', '--output', '/var/folders/9y/7cx5sqf153gc_x672pscmn2c0000gq/T/tmpy1t1mi8d.small.coords', '--nthreads', '16', '--seed', '0', '--targetChangePerNode', '2.0', '--targetSteps', '5000', '--2d']
Force-directed layout is generated.
2019-09-24 12:41:06,538 - sccloud - INFO - 0.14763095106845114
2019-09-24 12:41:06,639 - sccloud - INFO - Deep regressor traning and predicting finished. Time spent = 5.41s.
2019-09-24 12:41:07,463 - sccloud - INFO - Graph is constructed. Time spent = 0.82s.
/var/folders/9y/7cx5sqf153gc_x672pscmn2c0000gq/T/tmpy1t1mi8d.net is written.
['java', '-Djava.awt.headless=true', '-Xmx8g', '-cp', '//anaconda3/envs/sccloud/lib/python3.7/site-packages/forceatlas2/ext/forceatlas2.jar://anaconda3/envs/sccloud/lib/python3.7/site-packages/forceatlas2/ext/gephi-toolkit-0.9.2-all.jar', 'kco.forceatlas2.Main', '--input', '/var/folders/9y/7cx5sqf153gc_x672pscmn2c0000gq/T/tmpy1t1mi8d.net', '--output', '/var/folders/9y/7cx5sqf153gc_x672pscmn2c0000gq/T/tmpy1t1mi8d.coords', '--nthreads', '16', '--seed', '0', '--targetChangePerNode', '2.0', '--targetSteps', '1500', '--2d', '--coords', '/var/folders/9y/7cx5sqf153gc_x672pscmn2c0000gq/T/tmpy1t1mi8d.init.coords.txt']
Force-directed layout is generated.
2019-09-24 12:43:04,735 - sccloud - INFO - Net FLE is calculated. Time spent = 135.81s.
In [59]:
pg.embedding(adata, basis = 'net_fle', keys = ['louvain_labels'], width = 700, height = 700)
Out[59]:

(Advanced) More on Clustering

In previous sections, we use Louvain algorithm for clustering. Including Louvain, pegasus provides 4 algorithms:

  • pg.louvain: Louvain algorithm, using louvain-igraph package.
  • pg.leiden: Leiden algorithm, using leidenalg package.
  • pg.spectral_louvain: Spectral Louvain algorithm, which requires Diffusion Map.
  • pg.spectral_leiden: Spectral Leiden algorithm, which requires Diffusion Map.

We have used louvain method. Its documentation is here. By default, it uses resolution 1.3, does clustering via PCA matrix, and stores result with key louvain_labels in obs field of data.

Leiden Algorithm. leiden method (See details), by default, uses resolution 1.3, does clustering via PCA matrix, runs the algorithm until reaching its optimal clustering (n_iter parameter), and stores result with key leiden_labels in obs field of data.

However, with the default n_iter parameter, it can take a very long time to converge, while the result doesn't improved much. So instead, you may think of feed it a small positive number to put a hard specification on how many iterations to run. Here, we use 2:

In [60]:
pg.leiden(adata, n_iter = 2)
2019-09-24 12:43:08,051 - sccloud - INFO - Graph is constructed. Time spent = 2.19s.
2019-09-24 12:43:14,945 - sccloud - INFO - Leiden clustering is done. Time spent = 9.08s.

We can compare its result with Louvain's via UMAP:

In [61]:
pg.embedding(adata, basis = 'umap', keys = ['louvain_labels', 'leiden_labels'])
Out[61]:

Spectral Louvain Algorithm spectral_louvain method (See details), by default, uses resolution 1.3, runs KMeans on Diffusion Map, does clustering on PCA matrix, and stores result with key spectral_louvain_labels in obs field of data.

In [62]:
pg.spectral_louvain(adata)
2019-09-24 12:43:40,655 - sccloud - INFO - run_multiple_kmeans finished in 23.81s.
2019-09-24 12:43:42,391 - sccloud - INFO - Graph is constructed. Time spent = 1.73s.
2019-09-24 12:43:42,896 - sccloud - INFO - Spectral Louvain clustering is done. Time spent = 26.05s.

Spectral Leiden Algorithm spectral_leiden method (See details), by default, uses resolution 1.3, runs KMeans on Diffusion Map, does clustering on PCA matrix, and stores result with key spectral_leiden_labels in obs field of data.

In [63]:
pg.spectral_leiden(adata)
2019-09-24 12:43:48,741 - sccloud - INFO - run_multiple_kmeans finished in 5.67s.
2019-09-24 12:43:50,471 - sccloud - INFO - Graph is constructed. Time spent = 1.73s.
2019-09-24 12:43:51,230 - sccloud - INFO - Spectral Leiden clustering is done. Time spent = 8.16s.

Below are the UMAP plots of the clustering results of these two algorithms:

In [64]:
pg.embedding(adata, basis = 'umap', keys = ['spectral_louvain_labels', 'spectral_leiden_labels'])
Out[64]:

Save Results

In [65]:
pg.write_output(adata, 'tutorial_results')
2019-09-24 12:44:07,896 - sccloud - INFO - Write output is finished. Time spent = 14.17s.